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ABSTRACT 


In  principal,  the  version  space  approach  can  be  applied  to  any  induction  problem. 
However,  in  some  cases  the  representation  language  for  generalizations  is  so  oowerful  that 

(1)  some  of  the  update  functions  for  me  version  space  are  not  effectively  computable,  and 

(2)  the  version  space  contains  infinitely  many  generalizations.  The  class  of  context-free 
grammars  is  a  simple  representation  that  exhibits  these  problems.  This  paper  presents  an 
algorithm  that  solves  these  problems  for  context-free  grammars.  Given  a  seequence  of 
strings,  the  algorithm  incrementally  constructs  a  data  structure  that  has  almost  all  the 
beneficial  properties  of  a  version  space.  The  algorithm  is  fast  enough  is  fast  enough  to 
solve  small  induction  problems  completely,  and  it  serves  as  a  framework  for  biases  that 
permit  solving  larger  problems  heuristically.  The  techniques  used  to  develop  the 
algorithm  may  be  applied  in  constructing  version  spaces  for  representations  (e.g., 
production  systems,  Horn  clauses,  And-Or  graphs)  that  include  context-free  grammars  as 

special  cases. 
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1.  Introduction 

The  nroblem  addressed  here  arose  in  the  course  of  studying  how  people  learn 

arithmetic  procedures  from  examples  (Vanlehn,  1983a.  Vanlehn.  1983b).  Our  data 

allows  us  to  infer  the  procedures  the  sublets  have  learned  and  the  examples  they 

received  during  training  (approximately).  Thus,  the  inputs  and  outputs  to  the  learning 
process  are  known,  and  the  problem  is  to  describe  the  learning  process  in  detail 
However,  because  the  subjects'  learning  occurs  intermittently  over  several  years,  we 
are  not  immediately  interested  in  developing  a  detailed  cognitive  simulation  of  their 

learning  processes.  Even  if  such  a  simulation  could  be  constructed,  it  might  be  so 
complicated  that  it  wouldn  t  shed  much  light  on  the  basic  principles  of  learning  in  this 
task  domain.  Therefore,  our  initial  objective  is  to  find  principles  that  act  as  a  partial 
specification  of  the  learning  process.  The  principles  we  seek  take  the  form  of  a 
representation  language  for  procedures  and  some  inductive  biases  that  postdict  the 
procedures  learned  by  our  subjects.  More  precisely,  our  problem  is: 


•  Given 


1  a  training  sequence,  consisting  of  examples  of  a  procedure  being 
executed,  and 

2.  a  set  of  observed  procedures,  represented  in  some  informal  language 
(i.e. .  English), 

•  find 


1.  a  representation  language  for  procedures,  and 

2  a  set  of  inductive  biases,  expressed  as  predicates  on  expressions  in 
the  representation  language. 

•  such  that  the  set  of  all  procedures  that  are  consistent  with  the  examples 
and  preferred  by  the  biases 


i.  includes  the  observed  procedures,  and 

2  excludes  implausible  procedures  ie  g..  ones  that  never  halt) 

This  method  for  studying  the  structure  of  mental  representations  and  processes  has 
much  to  recomend  it  (Vanlehn.  Brown  &  Greeno.  1984.  Fodor  1975).  but  here  we 
wish  to  discuss  only  the  technical  issues  involved  m  implementing  it.  The  central 
technical  problem  is  calculating  the  sets  mentioned  above  The  calculation  must  be 
done  repeatedly,  once  for  each  combination  of  representation  language,  biases  and 
training  sequence.  Although  the  calculations  could  be  done  by  hand,  it  is  probably 
easier  to  program  a  computer  to  perform  them.  Rather  than  build  one  program  that 
could  handle  all  combinations,  or  one  program  for  each  combination,  we  chose  a 
hybrid  approach 


The  approach  is  to  build  a  different  program  for  each  representation  language. 
The  programs  are  induction  programs,  in  that  they  take  a  sequence  of  training 
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examples  and  calculate  expressions  in  the  representation  language  that  are 

generalizations  of  those  examples.  The  inducers  are  unbiased,  m  that  they  produce  all 

expressions  in  the  language  consistent  with  their  inputs.  An  unbiased  inducer  provides 
a  framework  on  which  we  can  install  explicit  biases  in  an  attempt  to  fit  its  output  to 
the  data.  The  advantage  of  this  approach  is  that  tuning  an  unbiased  inducer  is  much 
easier  than  building  a  different  biased  inducer  for  each  set  of  biases  The  main 

technical  problem  of  implementing  this  approach  is  devising  an  unbiased  inducer  for 

each  of  the  hypothesized  representation  languages. 

It  is  very  important  to  understand  that  these  inducers  are  merely  tools  for 

generating  certain  sets  that  we  are  interested  in  studying.  They  are  not  meant  to  be 
models  of  the  students  learning  processes. 

This  approach  works  fine  for  some  representation  languages,  but  not  for  others. 
Some  procedure  representation  languages  teg.,  those  use  by  Anderson  (1983)  and 

\/anLehn  (1983c))  are  based  on  recursive  goal  hierarchies  that  are  isomorphic  to 
context-free  grammars.1  For  several  reasons,  it  is  impossible  to  construct  an  inducer 
that  produces  the  set  of  all  context-free  grammars  consistent  with  a  given  training  set 
First,  such  a  set  would  be  infinite.  Second,  the  standard  technique  for  representing 
such  a  set.  Mitchells  version  space  technique  (Mitchell.  1982).  seems  inapplicable 
because  the  crucial  more-specific-than'  relationship  is  undecidable  for  context-free 
grammars.2  The  proofs  for  these  points  will  be  presented  later.  Although  we  could 
have  abandoned  exploration  of  procedure  representation  languages  with  recursive  goat 
hierarchies,  we  chose  instead  to  attack  the  subproblem  of  finding  a  suitable  induction 
algorithm  for  context-free  grammars. 

The  impossibility  of  an  unbiased  inducer  means  that  a  biased  one  must  be 
employed  as  the  framework  on  which  hypothesized  biases  are  installed  for  testing  their 
fit  to  the  data.  Because  we  will  not  be  able  to  test  the  fit  with  the  built-in  bias 
removed,  the  built-in  bias  must  be  extremely  plausible  a  priori  Moreover,  there  must 

be  an  algorithm  for  calculating  the  set  of  grammars  consisent  with  it.  and  that  set 

must  be  finite. 

We  found  such  a  bias,  and  called  it  " reducedness  "  A  grammar  is  reduced  if 

removing  any  of  its  rules  makes  it  inconsistent  with  the  training  examples.  Later,  the 

plausibility  of  reducedness  will  be  argued  for.  and  more  imoortantly.  it  will  be  proved 

that  there  are  only  finitely  many  reduced  grammars  consistent  with  any  given  training 
sequence.  This  proof  is  one  of  the  main  results  presented  in  this  paper 

The  proof  contains  an  enumerative  algorithm  for  generating  the  set  of  reduced 
grammars  consistent  with  a  training  sequence,  but  the  algorithm  is  far  to  slow  to  be 
used.  in  order  to  experiment  with  biases,  we  need  an  algorithm  that  can  take  a 
training  sequence  of  perhaps  a  dozen  examples,  and  produce  a  set  of  reduced 

grammars  in  a  day  or  less  time. 


i 

A  context-free  grammar  is  a  set  of  rewrite  rules,  similar  >o  a  simoie  oroduction  svstem  me  next 
section  gives  orectse  definitions  of  the  relevant  terms  from  formal  language  'heorv 
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he  rersion  soace  ’echmaue 


is  exoiamed  m  me  next  section 


The  obvious  candidate  for  a  faster  algorithm  is  Mitchell's  version  sDace  strategy 
(Mitchell.  1982).  Applying  the  strategy  seems  to  involve  conquering  the  undecidability  of 
the  more-specific-than'  relationship  for  grammars.  However,  we  discovered  that  it  was 
possible  to  substitute  a  decidable  relationship  for  'more-specific-one  and  thereby 
achieve  an  algorithm  that  had  almost  all  the  beneficial  properties  of  the  version  space 
technique.  In  particular,  it  calculates  a  finite,  partially  ordered  set  of  grammars  that 
can  be  represented  compactly  by  the  maximal  and  minimal  grammars  in  the  order 
The  set.  unfortunately,  is  not  exactly  the  set  of  reduced  grammars,  but  it  does 
properly  contain  the  set  of  reduced  grammars  We  call  it  the  derivational  version 
space. 

The  derivational  version  space  satisfies  our  original  criterion:  it  is  a  set  of 
consistent  grammars  which  is  arguably  a  superset  of  the  set  of  grammars  Qua 
procedures  that  people  learn  Moreover  the  algorithm  for  calculating  it  is  fast  enough 
that  small  training  sequences  can  be  processed  in  a  few  hours,  and  the  structure  of 
the  algorithm  provides  several  places  for  installing  interesting  biases.  The  derivational 
version  space  is  the  second  result  to  be  presented  in  the  paper 

The  main  interest  for  machine  learning  researchers  lies  in  the  generality  of  the 
techniques  we  used.  The  reducedness  bias  can  be  applied  directly  to  many 
representation  languages.  For  instance,  an  expression  in  disjunctive  normal  form  (i  e.. 
a  disiunction  of  conjunctions)  is  reduced  if  deleting  any  of  its  disjuncts  makes  the 
expression  inconsistent  with  the  training  examples.  The  finiteness  result  for  reduced 
grammars  suggests  that  sets  of  reduced  expressions  in  other  representations  are  also 
finite  and  effectively  computable.3  Moreover,  the  technique  of  substituting  an  easily 
computed  relation  for  the  more-specific-than  relation  suggests  that  such  sets  of 
reduced  expressions  can  be  efficiently  computed  using  the  derivational  version  space 
strategy 

indeed,  the  fact  that  substituting  another  relation  for  more-specific-than  leads  to 
a  useful  extension  of  the  version  space  strategy  suggests  looking  for  other  relationships 
that  provide  the  benefits  of  version  spaces  without  the  costs.  This  idea  is  independent 
of  the  idea  of  reducedness.  Both  ideas  may  be  useful  outside  the  context  of 
grammar  induction. 

There  are  four  mam  sections  to  this  paper  The  first  introduces  the  relevant 
terminology  on  grammars,  grammar  induction  and  version  spaces  The  second 
discusses  reducedness  and  the  finiteness  of  the  set  of  grammars  consistent  with  the 
examples.  The  third  discusses  the  derivational  version  space.  The  fourth  presents  the 
induction  algorithm  for  this  structure,  and  demonstrates  the  results  of  incorporating 
certain  biases  into  it.  The  concluding  section  speculates  on  the  larger  significance  of 
this  work. 


'i  T'igH  argued  mat  although  ,He  teas  ran  be  acotied  *o  other  representation  anguages.  one  ""o'*’ 
'c  «ant  to  however  'educedness  5  a|readv  reeved  5v  an  constructive  nduct'on  orog-ams  ’hat  ve  are 
'amihar  with.  nciudmg.  eg.  nducers  Cv  Qumian  1198SI.  Wicha'sio  ''983:  and  ;ere  .'9~5' 
°educedness  s  not  usuaMv  obeved  bv  enumeration-oased  nduction  aiaor’hms.  such  as  mese  ’cunc  •* 
,he  literature  on  language  'dentification  m  'he  limit  ’Csherson.  Stob  &  Wemsiem.  '985>  :  Aooare""v  ”*e 
des-gners  ct  constructive  mduce’s  ccc-e  at  •!  3  natural  tor  an  induced  genera”zat"on  10  "cede  rn|v 
carts  e  g..  rules,  disiuncisi  ’hat  have  some  sucoort  ”  ’he  data  ^educedness  s  a  c'ec’se  statement  e 
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2.  Terminology 


2.1  Introduction  to  grammars  and  grammar  induction 

A  grammar  is  a  finite  set  of  rewrite  rules.  A  rule  is  written  a  -*  where  a  and 
/}  are  strings  of  symbols.  Grammars  are  used  to  generate  strings  by  repeatedly 
rewriting  an  initial  string  into  longer  and  longer  strings  In  this  article,  the  initial  string 
is  always  "S"  For  instance,  the  following  grammar 

S  b 
S  aS 

generates  the  string  "aab"  via  two  applications  of  the  second  rule  and  one  application 
of  the  first  rule: 


S  aS  aaS  -»  aab 

Such  a  sequence  of  rule  applications  is  called  a  derivation  There  are  two  kinds  of 
symbols  in  grammars.  Terminals  are  symbols  that  may  appear  in  the  final  string  of  a 
derivation,  whereas  nonterminals  are  not  allowed  to  appear  in  final  strings.  in  the 
above  grammar,  a  and  b  are  terminals,  and  S  is  a  nonterminal. 


The  grammar  induction  problem  is  to  infer  a  grammar  that  will  generate  a  given 
set  of  strings.4  The  set  of  strings  given  to  the  learner  is  called  the  presentation  it 
always  contains  strings  that  the  induced  grammar  should  generate  (called  positive 
strings)  and  it  may  or  may  not  contain  strings  that  the  induced  grammar  should  not 
generate  (called  negative  strings).  For  instance  given  the  presentation 

•  a.  +  ab.  *  aab.  -  ba 

the  grammar  given  earlier  (call  it  grammar  1)  could  be  induced  because  it  generates 
the  two  positive  strings,  "ab"  and  "aab'.  and  it  cannot  generate  the  two  negative 
strings,  "a'  and  "ba"  A  grammar  is  said  to  be  consistent  with  a  presentation  >f  :t 
generates  all  the  positive  strings  and  none  of  the  negative  strings  5 


There  are  usually  many  grammars  consistent  with  a  given  presentation  rcr 
instance,  here  are  two  more  grammars  consistent  with  the  presention  mentioned  above 


Grammar  2 
S  -»  A 
A  ->  b 
A  ->  aA 


Grammar  3 
S  -»  Sb 
Sb  ->  ab 
S  aa 


ntfucf'C"  5  SfUd'SCf  ■  n  3f  'easf  ''tf'ds  --  OMilTSCC^V.  'ng'vf'Sf'CS  2n3  a^'ical  m^Higon-o 

-evtews  vewcomrs  of  ?acn.  see  'escective'v  Csherso^  Stob  &  .ve,nste‘n  ■ 1 9 8 5'  p",*er 

♦  o?gt  antf  Lang'ev  4  CarOcneN  <996)  m  add't'on.  Cohen  &  pe'genbauFn  r 1 9 3 3>  give  an  evce’ie^f 

t  /  erv  e  v 


"Some  aumcs  vise  ’ deductive'v  adeauate '  uo'"mg.  ’959)  or  "consistent  and  ccmciete'  . M > c " a i ? ^ ' 
'993'  for  me  same  ccnceot  Wo  i«o  ’h»  •orm  "•msistent  '  m  order  !C  Crirrj  £  ’erminologv  Of  ’his 
dac“r  mo  me  vih  me  ’ermmoicgv  of  Mitchell  s  ’982)  .von<  on  /ersion  sDaces 
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Grammar  2  is  equivalent  to  grammar  t  >n  that  '?  generates  exactly  the  same  set  of 
strings:  ! b.  ad.  aab,  aaad.  aaaad.  I  The  set  of  ail  strings  generated  by  a 

grammar  is  called  the  language  generated  by  the  grammar  The  language  in  this 
case  is  an  infinite  set  of  strings.  However  languages  can  be  finite  The  language 
generated  by  Grammar  3  is  the  finite  set  ;ab.  aa,  aab; 

Grammar  induction  is  considerably  simpler  if  restrictions  are  placed  on  the  c'ass 
of  grammars  to  be  induced.  Classes  of  grammars  are  often  defined  by  specifying  a 
format  for  the  grammars  that  are  members  of  the  class.  For  instance,  grammars  i 
and  2  obey  the  format  restriction  that  the  rules  have  exactly  one  nonterminal  as  the 
left  side  Grammars  having  this  format  are  called  context-free  grammars  Grammar  3 
is  not  a  context-free  grammar 


2  2  Version  spaces 

One  of  the  most  widely  studied  forms  of  machine  learning  is  learning  from 
examples,  or  induction,  as  it  is  more  concisely  called  The  following  is  a  standard 
way  to  define  an  induction  problem  5 


•  Given 


1  A  representation  language  for  generalizations: 

2  A  predicate  of  two  arguments,  a  generalization  and  an  instance  that 
is  (rue  if  the  generalization  matches  the  instance. 

3  A  set  of  instances,  where  an  instance  is  marked  "positive'  f  it 
should  be  matched  by  the  induced  generalizations,  and  'negative'  if 
>t  should  not: 

4  A  set  of  biases  that  indicate  a  preference  order  for  generalizations 
•  Find.  One  or  more  generalizations  that  ?re 

i  consistent  with  the  instances,  and 
2.  preferred  by  the  biases. 

; 

where  'consistent"  means  that  the  generalization  matches  ail  the  positive 
instances  and  none  of  the  negative  instances 

This  formulation  is  deliberately  vague  m  order  to  encompass  many  specific  induction 
problems  For  instance,  the  instances  may  be  ordered  There  may  be  no  negative 
instances  There  may  be  no  biases,  or  biases  that  rank  generalizations  on  a 
numerical  scale,  or  biases  that  partially  order  the  set  of  generalizations  Much  work  m 
machine  learning  is  encompassed  by  this  definition 
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Mitchell  defines  a  version  space  to  be  the  set  of  all  generalizations  consistent 
with  a  given  set  of  instances.  This  is  just  a  set.  with  no  other  structure  and  no 
associated  algorithm.  However,  Mitchell  also  defines  the  version  space  strategy  to  be  a 
particular  induction  technique,  based  on  a  compact  way  of  representing  the  version 
space  Although  popular  usage  of  the  term  "version  space"  has  drifted  this  paper 
will  stick  to  the  original  definitions. 

The  central  >dea  of  the  version  space  strategy  is  that  the  space  of 
generalizations  defined  by  the  representation  language  can  be  partially  ordered  by 

generality  One  can  define  the  relation  Generaiizestx.y)  in  terms  of  the  matching 
predicate 

Definition  Generaiizestx.y)  is  true  if  and  only  if  the  set  of  instances 
matcned  by  x  is  a  superset  of  the  set  of  instances  matched  by  y 

Note  that  the  Generalizes  relationship  is  defined  m  terms  of  the  denotations  of 
expressions  m  the  representation  language,  and  not  the  expressions  themselves  This 
will  become  'mporrant  later,  when  it  is  shown  that  the  Generalizes  relation  is 
undecidaoie  for  context-free  grammars 

it  is  simple  to  show  that  the  Generalizes  relation  partially  orders  the  space  of 

generalizations  Thus  no  matter  what  the  specific  induction  problem  may  be.  one  can 
always  imagine  its  answer  as  lying  somewhere  m  a  vast  tangled  hierarchy  which  nses 

from  very  specific  generalizations  that  cover  only  a  few  instances,  all  the  way  up  to 

generalizations  that  cover  many  instances 

Given  a  presentation  the  version  space  for  that  presentation  will  also  be  oart'aiiv 
ordered  by  ’he  Generalizes  relation  Given  some  mild  restrictions  <i  e  that  there  are 

no  mfimte  ascending  or  descending  chains  m  the  partial  order),  the  version  space  has 

a  subset  of  maximal  elements  and  a  subset  of  minimal  elements  The  maximal  set  'S 
called  G  because  it  contains  the  set  of  maximally  general  generalizations  The  minimal 
set  is  called  S  because  it  contains  the  maximally  specific  generalizations  The  cair 

[S  3]  can  be  used  to  represent  the  version  space  Mitchell  proved  that 

Given  a  presentation,  x  :s  m  the  version  space  fo r  that  presentation  if  and 
only  'f  there  'S  some  g  m  G  such  that  g  Generalizes  x  and  there  'S  some  s 
m  S  such  that  x  Generalizes  s. 

Three  algorithms  are  usually  discussed  <n  connection  with  the  (S  G!  reoresentat'cn 
of  version  spaces 


.  Upda re( i . ( S . G ! )  -->  ( S' ,G'  | 

The  Update  function  takes  the  current  version  space  boundaries  and  an 
instance  that  is  marked  as  either  positive  or  negative  !t  returns 
boundaries  for  the  new  version  space  if  the  instance  makes  the  version 
soace  empty  i>  e  there  is  no  generalization  that  is  consistent  with  the 
presentation,  as  when  the  same  instance  occurs  both  positively  and 
negatively],  then  some  marker  such  as  Lisp  s  Nil.  is  returned  The  Update 
algorithm  'S  the  induction  algorithm  *or  version  soace  boundaries  hs 
implementation  depends  on  the  representation  language 
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•  DoneP([S,Gj)  -->  true  or  false 

Unlike  many  induction  algorithms,  it  is  possible  to  tell  when  further 
instances  will  do  no  good  because  the  version  space  has  changed  as 
much  as  it  is  going  to  DoneP  is  implemented  by  a  test  for  set  equality 
S  =  G 


•  Class i fy ( i ,  ( S ,G  | )  -->  or  ? 

Classify  an  instance  that  is  not  marked  positively  or  negatively,  and  the 
version  space  boundaries.  it  returns  "  "  if  the  instance  would  be 
matched  by  all  the  generalizations  in  the  version  space,  it  returns  if  it 
would  be  matched  by  no  generalizations  it  returns  "r)"  otherwise 
Classify  is  useful  for  experiment  design  if  instances  are  marked  by  some 
expensive-to-use  teacher  le  g..  a  gene  sequenator.  or  a  proton  coliiden 
then  one  wants  to  check  tha'  an  instance  will  cause  some  change  in  the 
version  space  before  having  the  teacher  decide  whether  it  is  a  positive  or 
negative  instance  Only  instances  that  receive  from  Classify  are  worth 

submitting  to  the  teacher  Classify  is  implemented  as  follows:  if  all  s  m  S 
match  i.  then  return  "  * "  else  if  no  g  m  G  match  i.  then  return  else 
return 

Applying  the  version  space  strategy  to  a  representation  language  means  that  one  must 
devise  only  an  appropriate  Update  function,  because  the  Classify  and  DoneP  functions 
come  for  free  with  the  strategy  This  is  sometimes  cited  as  the  chief  advantage  of 
the  version  space  approach.  in  our  work  on  skill  acquisition,  we  make  only  a  little 
use  of  them  Our  main  reason  for  preferring  the  version  space  strategy  over  other 
induction  strategies  'S  that  it  computes  exactly  the  set  we  need,  the  version  space 
and  represents  it  compactly. 


3.  Reduced  version  spaces 

The  first  prooiem  encountered  m  applying  the  version  space  strategy  to  grammar 
induction  is  that  the  version  space  will  be  always  be  infinite.  Th.s  does  not 
necessarily  imoiv  that  the  version  space  boundaries  will  be  infinite,  a  finite  S  and  G 
can  represent  an  infinite  version  soace.  However  for  grammars,  the  boundaries  also 
turn  out  to  be  infinite.  To  begin,  let  us  consider  a  well-known  theorem  about 
grammar  induction,  which  is: 

Theorem  1:  For  any  class  of  grammars  that  includes  grammars  !or  an 
the  finite  languages,  there  are  infinitely  many  grammars  m  the  class  that  are 
consistent  with  any  given  finite  presentation. 

That  is.  the  version  soace  'S  infinite  for  any  finite  presentation 

This  theorem  has  a  significance  outside  the  context  of  version  soace  ’echnoicgy 
For  instance  it  has  been  used  to  justify  nat/vist  approaches  to  language  acquisition 
(Pmker  19791  This  section  is  written  to  address  both  the  concerns  of  version  scace 
technology  and  the  larger  significance  of  this  theorem 
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3.1.  Normal  version  spaces  are  infinite 

Three  ways,  to  prove  the  theorem  will  be  presented  Simply  amending  the 
statement  of  the  theorem  to  prevent  the  use  of  each  of  the  proof  techniques  yields  a 
new  theorem,  which  is  one  of  the  results  of  this  article. 

All  three  proofs  employ  mathematical  induction.  The  initial  step  in  all  the  proofs 
is  the  same  Because  the  class  of  grammars  contains  grammars  for  all  finite 
languages,  and  the  positive  strings  of  the  presentation  constitute  a  finite  language,  we 
can  always  construct  at  least  one  grammar  that  is  consistent  with  the  presentation 
This  grammar  initializes  the  inductions.  The  inductive  steps  for  each  of  the  three 
proofs  are.  respectively: 


1.  Let  a  be  any  string  not  in  the  presentation.  Add  the  rule  S  -*  «  to  the 
grammar.  The  new  grammar  generates  exactly  the  old  grammars 

language  plus  a  as  well  Since  the  old  language  was  consistent  with  the 
presentation,  and  a  does  not  appear  in  the  presentation,  the  new  grammar 
is  also  consistent  with  the  presentation.  Because  there  are  infinitely  many 
strings  a  that  are  not  in  the  presentation,  infinitely  many  different  grammars 
can  be  constructed  this  way.  One  might  object  that  the  rule  S  -*  a 

may  be  in  the  grammar  already  However,  because  a  grammar  has  finitely 
many  rules,  there  can  be  only  finitely  many  such  a.  and  these  can  be 
safely  excluded  when  the  <*  required  by  the  proof  is  selected. 

2  Let  A  be  a  nonterminal  in  the  grammar,  and  let  B  be  a  nonterminal  not  in 
the  grammar.  Add  the  rule  A  B  to  the  grammar  For  some  or  all  of 
the  rules  that  have  A  as  the  left  side,  add  a  copy  of  the  rule  to  the 

grammar  with  B  substituted  for  A.  These  additions  create  new  grammars 
that  generate  exactly  the  same  strings  as  the  original  grammar  Because 
the  original  grammar  is  consistent  with  the  presentation,  so  are  the  new 
grammars  This  process  can  be  repeated  indefinitely,  generating  an  infinite 
number  of  grammars  consistent  with  the  presentation 

3.  Form  a  new  grammar  by  substituting  new  nonterminals  for  every 

nonterminal  in  the  old  grammar  (except  S).  Create  a  union  grammar 

whose  rules  are  the  union  of  the  old  grammar  s  rules  and  the  new 
grammar  s  rules.  The  union  grammar  generates  exactly  the  same  language 
as  the  original  grammar,  so  it  is  consistent  with  the  presentation  The 
union  process  can  be  repeated  indefinitely,  yielding  an  infinite  set  of 

grammars  consistent  with  the  presentation 

it  is  hard  to  imagine  why  a  machine  or  human  would  seriously  entertain  the 
grammars  constructed  above.  The  grammars  of  the  last  two  proofs  are  particularly 
worthless  as  hypotheses,  because  they  are  notational  variants  of  the  original  grammar 
In  a  moment,  we  will  add  restrictions  to  the  class  of  grammars  that  will  bar  such 
irrational  grammars. 

It  was  mentioned  that  an  infinite  version  space  can.  in  principle,  be  represented 
by  finite  boundaries.  Unfortunately,  this  does  not  work  for  grammars.  The  second 
two  proofs  above  will  generate  infinitely  many  grammars  that  generate  exactly  the 
same  language  as  the  initial  grammar  if  the  initial  grammar  is  from  S.  then  S  can 
be  made  infinite:  similarly.  G  can  be  made  infinite.  The  G  set  can  also  be  made 
infinite  by  the  first  proof  above.  These  comments  prove  the  following  theorem 
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Theorem  2:  If  the  representation  language  for  generalizations  specifies  a 
class  of  grammars  that  includes  grammars  for  all  finite  languages,  then  for 
any  finite  presentation,  the  version  space  boundaries.  S  and  G.  are  each 
infinite. 


3.2.  Reducedness  makes  the  version  space  finite 

One  way  to  make  the  version  space  finite  is  to  place  restrictions  on  the 

grammars  to  be  included  in  it.  As  some  of  these  restrictions  are  most  easily  stated 
as  restrictions  on  the  form  of  grammar  rules,  we  will  limit  our  attention  to  context-free 
grammars,  although  the  same  general  idea  works  for  some  higher  order  grammars  as 
well  (as  shown  in  the  appendix)  The  first  restriction  blocks  the  grammars  produced 
by  the  second  proof: 

Definition:  A  context-free  grammar  is  simple  if  (i)  No  rule  has  an  empty 
right  side.’  (2)  if  a  rule  has  just  one  symbol  on  its  right  side,  then  the 

symbol  is  a  terminal,  and  (3)  every  nonterminal  appears  in  a  derivation  of 
some  string. 

The  class  of  simple  grammars  can  generate  all  the  context-free  languages.  Hopcroft 
and  Ullman  (1979)  prove  this  (theorem  4  4)  by  showing  how  to  turn  an  arbitrary 

context-free  grammar  into  a  simple  context-tree  grammar  For  our  purposes,  the 

elimination  of  rules  of  the  form  A  B.  where  both  A  and  B  are  nonterminals, 

blocks  the  second  proof. 

Proofs  1  and  3  can  be  blocked  by  requiring  that  all  the  rules  in  an  induced 

grammar  be  necessary  for  the  derivation  of  some  positive  string  in  the  given 
presentation  To  put  this  formally: 

Definition:  Given  a  presentation  P  a  grammar  is  reduced  if  it  is  consistent 
with  P  and  rt  there  is  no  proper  subset  of  its  rules  that  is  consistent  with  P 

Removing  rules  from  a  grammar  will  only  decrease  the  size  c*  the  language  generated 
not  increase  it  So  removing  rules  from  a  grammar  will  not  make  it  generate  a 

negative  string  that  it  did  not  generate  before  However,  deleting  rules  may  Drevent 

the  grammar  from  generating  a  positive  string,  thus  making  it  inconsistent  with  me 
presentation  if  any  deletion  of  rules  causes  inconsistency,  the  grammar  is  reduced 

in  proof  1.  adding  the  rules  S  -*  a  creates  a  new  grammar  mat  is  reducible 
Similarly,  the  union  grammar  formed  by  proof  3  is  reducible  This  leads  to  the 
theorem 


Theorem  3:  Given  a  finite  presentation,  there  are  finitely  many  reduced 
simple  context-free  grammars  consistent  with  that  presentation 


'  rhis  reduces  'he  ?*Drsss've  cower  of  m«  cass  scrne,,'r,at.  Because  a  grammar  without  such  ecs'ic" 
”jies  as  >hev  are  commonly  caned,  cannot  gene'ate  mo  omctv  string 
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The  proof  of  this  theorem  is  presented  in  the  appendix.  We  call  the  version  space  of 
reduced,  simple  grammars  a  reduced  version  space  for  grammars 

Any  finite  partially  ordered  set  has  a  finite  subset  of  minimal  elements  and  a 
finite  subset  of  maximal  elements.  Define  the  reduced  G  and  S  as  the  maximal  and 
minimal  sets,  respectively,  of  the  reduced  version  space  under  the  partial  order 
established  by  the  Generalizes  relation,  it  follows  immediately  that 

Theorem  4:  Given  a  finite  presentation,  the  reduced  G  and  S  sets  are 
each  finite. 


3.3.  The  behavior  of  reduced  version  spaces 

This  section  describes  some  of  the  ways  in  which  a  reduced  version  space 
differs  from  a  normal  version  space 

Normally,  a  version  space  can  only  shrink  as  instances  are  presented.  As  each 
instance  is  presented,  generalizations  are  eliminated  from  the  version  space.  With  a 

reduced  version  space,  negative  instances  cause  shrinking,  but  positive  instances 
usually  expand  the  reduced  version  space.  To  see  why.  suppose  that  at  least  one  of 
the  grammars  in  the  current  version  space  cannot  generate  the  given  positive  string. 
There  are  usually  several  ways  to  augument  the  grammar  in  order  to  generate  the 
string.  For  instance,  one  could  add  the  rule  S  -*  a.  where  a  is  the  string.  Or  one 

could  add  the  rules  S  A  ft  and  A  y.  where  a  =  y/f.  Each  way  of  augmenting 

the  current  grammar  in  order  to  generate  the  new  string  contributes  one  grammar  to 
the  new  version  space.  So  positive  strings  cause  the  reduced  version  space  to 
expand. 

Because  presenting  a  positive  string  caused  the  reduced  version  space  to 

expand,  the  equality  of  S  and  G  no  longer  implies  that-  induction  is  done.  That  'S. 

the  standard  implementation  of  DoneP  doesn  t  work.  We  conjecture  that  Gold's  M967) 
theorems  would  allow  one  to  show  that  there  is  no  way  to  tell  when  induction  of  a 

reduced  version  space  is  done. 

The  S  set  for  the  reduced  version  space  turns  out  to  be  rather  boring  it 

contains  only  grammars  that  generate  the  positive  strings  in  the  presentation  We  call 
such  grammars  "trivially  specific"  because  they  do  nothing  more  than  record  the 

positive  presentation.  The  version  space  Update  algorithm  described  below  do  not 

bother  to  maintain  the  S  set.  although  it  could  instead,  it  maintains  P  *•  the  set  of 
positive  strings  seen  so  far  In  order  to  illustrate  the  efficiency  gained  by  this 
substitution,  consider  the  Classify  function  whose  normal  definition  is:  where  i  <s  an 

instance  to  be  classified,  if  all  s  m  S  match  i.  then  return  "  * "  else  'f  no  g  in  G 

matches  i.  then  return  else  return  "r>"  With  P .  the  first  clause  of  the  definition 
becomes  if  i  is  in  P  * .  then  return  "  * "  Because  S  contains  only  the  trivially 

specific  grammars,  these  two  tests  are  eauivalent  Clearly,  it  is  more  efficient  to  use 
P instead  of  S.  Similar  efficiencies  are  gained  in  the  implementation  of  the  Update 
algorithm  Nowlan  (1987)  presents  an  alternative  solution  to  this  problem  with  some 
interesting  properties 
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3.4,  Why  choose  reducedness  for  an  inductive  bias? 

The  basic  idea  of  reducedness  applies  to  other  representation  languages  For 
instance,  suppose  the  representation  is  a  first  order  logic  whose  expressions  are  in 
disiunctive  normal  form  (i.e.,  a  generalization  is  one  large  disiunction,  with  conjunctions 
inside  it)  The  rules  in  a  grammar  are  like  disjuncts  in  a  disjunction.  Therefore,  a 
disjunctive  normal  form  expression  is  reduced  if  removing  a  disjunct  from  it  makes  it 
inconsistent  with  the  presentation.  We  conjecture  that  the  reduced  version  space  for 
disjunctive  normal  forms  will  turn  out  to  be  finite.  There  may  be  a  general  theorem 
about  reducedness  and  finiteness  that  would  apply,  at  the  knowledge  level  perhaps 
(Newell.  1982.  Dietterich,  1986).  to  many  representation  languages. 

From  the  machine  learning  literature,  it  seems  that  reducedness  is  a  "common 
sense"  restriction  to  place  on  induction.  Ail  heuristic  concept  induction  programs  with 
which  we  are  familiar  (e  g..  Michalski,  1983:  Vere,  1975:  Quinlan,  1986)  consider  only 
reduced  concepts  Reducedness  seems  to  be  such  a  rational  restriction  that  machine 
learning  researchers  adopt  it  implicitly. 

There  are  other  ways  to  restrict  grammars  so  that  there  are  only  finitely  many 
grammars  consistent  with  a  finite  presentation  For  instance,  there  are  only  finitely 
many  simple,  trivially  specific  grammars  consistent  with  a  finite  presentation.  However, 
the  restriction  to  reduced,  simple  grammars  seems  just  strong  enough  to  block  the 
procedures  that  produce  an  infinitude  of  grammars  without  being  so  strong  that 
interesting  grammars  are  blocked  as  well.  This  makes  it  an  ideal  restriction  to  place 
on  version  spaces  for  grammars.  The  chief  advantage  of  version  spaces  is  that  they 
contain  all  the  generalizations  consistent  with  the  presentation,  in  order  to  retain  the 
basic  spirit  of  version  spaces  while  making  their  algorithms  effective,  one  should  add 
the  weakest  restrictions  possible.  For  grammars,  the  conjunction  of  reducedness  with 
simplicity  seems  to  be  such  a  restriction 


4.  Applying  the  version  space  strategy  to  reduced  version  spaces 

The  proof  of  theorem  3  puts  bounds  on  the  size  of  reduced  grammars  and  their 
rules  m  principle,  the  reduced  version  space  could  be  generated  by  enumerating  all 
grammars  within  these  bounds.  However,  such  an  algorithm  would  be  too  slow  to  be 
useful.  This  section  discusses  a  technique  that  yields  a  much  faster  induction 
algorithm 

The  version  space  strategy  is  the  obvious  choice  for  calculating  a  reduced 
version  space,  but  it  cannot,  we  believe,  be  applied  directly  The  problem  is  that  the 
version  space  strategy  is  based  on  the  Generalizes  relationship,  which  is  defined  by  a 
superset  relationship  between  the  denotations  of  two  generalizations.  if  the 
generalizations  are  grammars,  then  the  denotations  are  exactly  the  languages 
generated  by  the  grammars.  Implementing  Generaiizes(x.y)  is  equivalent  to  testing 
whether  the  language  generated  by  x  includes  the  language  generated  by  y  This  test 
is  undecidable  for  context-free  grammars  or  grammars  of  higher  orders  (Hopcroft.  & 
Ullman.  1979.  theorem  8.12).  This  means  that  there  is  no  algorithm  for  implementing 
Generalizes(x.y)  over  the  context-free  grammars. 

This  result  does  not  prove  that  the  version  space  strategy  is  inapplicable 
because  only  the  Update  algorithm  is  required  m  order  to  construct  a  version  space 
and  there  is  no  proof  (yet)  that  a  computable  Generalizes  is  necessary  for  a 
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computable  Update.  On  the  other  hand,  we  have  never  seen  a  version  space  Update 
algorithm  that  did  not  call  Generalizes  as  a  subroutine,  and  we  have  no  idea  how  to 
build  a  Generallz«s-free  Update  algorithm  for  grammars.  So  the  undecidabillty  of  the 
Generalizes  predicate  is  a  practical  impediment,  at  least. 

The  Generalizes  predicate  may  be  decidable  if  its  arguments  are  restricted  to  be 
reduced  grammars  for  the  same  presentation,  if  so.  then  it  may  be  possible  to  use 
Generalize  In  an  Update  algorithm  that  only  works  for  the  reduced  version  soace.  and 
not  the  normal  version  space.  This  is  not  an  approach  that  we  explored,  instead,  we 
sought  a  way  to  apply  the  spirit  of  the  version  space  strategy  while  avoiding  the 

troublesome  Generalizes  predicate  entirely. 

The  "trick"  to  the  version  space  strategy  is  using  the  boundaries  of  a  partial 
order  to  represent  a  very  large,  partially  ordered  set.  in  principle,  this  trick  can  be 
based  on  any  partial  order,  and  not  necessarily  on  the  partial  order  established  by 

Generalizes.  This  idea  led  us  to  seek  a  partial  order  that  was  "like"  Generalizes,  and 
yet  computable.  Moreover,  the  partial  order  had  to  be  such  that  there  was  an  Update 
algorithm  for  the  sets  of  maximal  and  minimal  elements  in  the  order. 

it  was  not  difficult  to  find  a  computable  partial  order  on  grammars,  but  we  never 
found  an  Update  algorithm  that  could  maintain  sets  that  were  the  boundaries  of  exactly 
the  reduced  version  space,  instead,  we  did  find  one  for  a  superset  of  the  reduced 

version  space.  In  particular,  we  found: 

•  A  set.  called  the  derivational  version  space,  that  is  a  superset  of  the 
reduced  version  space  and  a  subset  of  the  version  space. 

•  A  computable  predicate  called  FastCovers.  that  is  a  partial  order  over 

grammars  in  the  derivational  version  space. 

•  An  update  algorithm  for  the  maximal  and  minimal  elements  in  FastCovers 
of  the  derivational  version  space 

This  section  presents  the  derivational  version  space  and  the  FastCovers  relation  The 
next  section  presents  the  Update  algorithm 


4.1  The  derivational  version  space 

in  order  to  define  the  derivational  version  space,  it  is  helpful  to  define  some 
ancillary  terms  first. 

A  derivation  tree  is  a  way  to  indicate  the  derivation  of  a  string  by  a  grammar 
(Derivation  trees  are  also  called  parse  trees )  The  derivation  trees  leaves  are  the 
terminals  in  the  string.  The  non-leaf  nodes  of  the  tree  are  labelled  by  nonterminals. 
The  root  node  is  always  labelled  by  the  root  nonterminal.  S.  An  algorithm  can  "read 
off"  the  rules  used  by  examining  mother-daughter  subtrees.  If  the  label  of  the  mother 
is  A  and  the  labels  of  the  daughters  are  B,  C  and  D.  then  the  rule  A  B  C  D 
has  been  applied.  This  reading  off  process  can  be  used  to  convert  derivation  trees 
into  a  grammar 
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For  simple  grammars,  derivation  trees  are  constrained  to  have  certain  possible 
shapes.  Simple  grammars  have  no  rules  of  the  form  A  B.  where  both  A  and  B 
are  nonterminals.  Therefore,  the  only  nodes  in  the  derivation  trees  that  have  singleton 
daughters  are  ones  whose  daughters  are  terminals,  because  the  only  rules  that  can 
have  singleton  right  sides  are  those  whose  right  side  consists  of  a  terminal.  Let  us 
call  trees  with  this  shape  simple  trees  The  definition  of  simple  grammars  makes  it 
impossible  for  a  simple  tree  to  have  long,  unbranching  chains.  Consequently,  there 
are  only  finitely  many  unlabelled  simple  trees  for  any  given  string. 

If  a  string  has  more  than  one  element,  then  there  is  more  than  one  unlabelled 
simple  tree.  Given  a  finite  sequence  of  strings,  one  can  calculate  all  possible 
sequences  of  unlabelled  simple  trees  by  taking  the  Cartesian  product  over  the  sets  of 
unlabelled  simple  trees  for  each  string.  Let  us  call  this  set  of  simple  tree  sequences 
the  simple  tree  product  of  the  strings.  Because  there  are  only  finitely  many  unlabelled 
simple  trees  for  each  string,  the  simple  tree  product  will  be  finite. 

The  definition  of  the  derivational  version  space  can  now  be  stated. 

Definition:  Given  a  set  of  positive  strings,  the  derivational  version  space  is 
the  set  of  grammars  corresponding  to  all  possible  labellings  of  each  tree 
sequence  in  the  simple  tree  product  for  those  strings.  Given  a  set  of 
positive  and  negative  strings,  the  derivational  version  space  is  the  derivational 
version  space  for  the  positive  strings  minus  grammars  that  generate  any  of 
the  negative  strings. 

An  example  may  clarify  this  definition.  Suppose  the  positive  strings  are  ”b"  and  "ab." 
The  construction  of  the  derivational  version  space  begins  by  considering  the  simple 
tree  product  for  the  strings.  There  is  one  unlabelled  tree  for  "b.”  There  are  four 
unlabelled  trees  for  "ab."  So  there  are  four  tree  sequences  in  the  Cartesian  product 
of  the  trees  for  "a"  and  the  trees  for  "ab"  These  four  tree  sequence  constitute  the 
simple  tree  Qroduct.  which  is  shown  in  figure  4-1  For  each  of  the  four  tree 
sequences,  the  construction  process  partitions  the  nodes  in  the  trees  and  assigns 
labels.  Figure  4-2  illustrates  how  the  fourth  unlabelled  tree  sequence  is  treated  At 
the  top  of  the  figure,  the  unlabelled  tree  sequence  is  shown  with  its  nodes  numbered. 
Trees  4  i  through  4  5  show  all  possible  partitions  of  the  four  nodes,  and  the  labellings 
of  the  trees  that  result.  Because  the  root  nodes  of  the  trees  must  always  received 
the  same  node  label.  S.  they  are  given  the  same  number,  which  forces  them  to  be  in 
the  same  partition  element,  and  hence  receive  the  same  labelling.  Each  of  the 
resulting  labelled  tree  sequences  is  converted  to  a  grammar  These  grammars  are 

shown  in  the  third  column  of  the  figure.  The  derivational  version  space  is  the  union 

of  these  grammars,  which  derive  from  the  fourth  tree  sequence,  with  the  grammars 

from  the  other  tree  sequences. 

The  motivation  for  the  derivational  version  space  is  the  following:  if  a  grammar 
is  going  to  parse  all  the  positive  strings,  then  there  must  be  a  sequence  of  simple 
derivation  trees,  one  for  each  string.  Such  a  sequence  must  be  some  possible 
labelling  of  some  possible  sequence  of  unlabelled  simple  trees.  The  derivational 
version  space  is  constructed  from  all  such  sequences,  i.e..  from  the  simple  tree 
product.  Consequently,  it  must  somehow  represent  all  possible  grammars,  except 
those  grammars  which  have  rules  that  were  not  used  during  the  parsing  of  those 
strings.  Those  grammars  are.  by  definition,  the  reducible  grammars.  So  the 
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a  b 


a 

Figure  4-1:  The  simple  tree  product  for  the  presentation  "b".  "ab". 

derivational  version  space  contains  all  the  reduced  grammars.  These  observations  lead 
to  the  following  theorem,  which  is  proved  m  the  appendix: 

Theorem  5:  Given  a  presentation,  the  derivational  version  space  for  it 
contains  the  reduced  version  space  for  it. 

Usually,  the  reduced  version  space  is  a  proper  subset  of  the  derivational  version 
space.  That  is.  the  derivational  version  space  often  contains  reducible  grammars  in 
the  illustration  discussed  earlier,  where  the  positive  strings  (P  +  )  are  "b”  and  "ab.”  no 
reducible  grammars  are  generated.  However,  if  P  +  is  !"b".  ''ab",  "ab"(  or  if  P*  is 
!  " b " .  "ab".  "abb";,  then  many  reducible  grammars  are  generated,  in  general,  if  a 
subset  of  P  j-  is  sufficient  to  produce  grammars  that  will  generate  all  of  it.  then  the 
derivational  version  space  will  contain  reducible  grammars. 

The  following  theorem  shows  that  the  "version  space"  component  of  the  name 
'derivational  version  space"  is  warranted: 

Theorem  6:  Given  a  presentation,  the  derivational  version  sDace  for  it  is 
contained  in  the  version  space  for  it. 

The  proof  follows  from  the  observation  that  the  grammars  m  the  derivational  version 
space  were  constructed  so  that  each  positive  string  has  a  derivation,  and  grammars 
that  generate  negative  strings  are  filtered  out  Consequently,  the  grammars  are 
consistent  with  the  presentation. 

Lastly,  we  note  that 

Theorem  7:  The  derivational  version  space  for  a  finite  presentation  is 

finite. 

The  proof  follows  from  the  earlier  observation  that  the  simple  tree  product  is  finite. 
Because  each  tree  sequence  m  the  product  has  only  finitely  many  nodes  and  there 
are  only  finitely  many  ways  to  partition  a  finite  set  into  equivalence  classes,  there  are 
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Figure  4-2:  Partitions,  labelled  trees  and  grammars  of  tree  sequence  4 
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only  finitely  many  ways  to  label  each  of  the  finitely  many  simple  tree  sequences. 
Hence,  the  derivational  version  space  for  P-t-  is  finite.  The  derivational  version  space 
for  the  whole  presentation  is  a  subset  of  the  one  for  P  + ,  so  it  too  is  finite. 

The  derivational  version  space  is  a  finite  set  that  contains  all  the  reduced  simple 
grammars,  and  moreover,  all  its  members  are  consistent  with  the  presentation  This  set 
suffices  for  the  purposes  we  outlined  in  the  introduction.  it  contains  all  the 

"plausible''  grammars,  and  it  is  finite.  We  show  next  that  there  is  a  partial  order  for 
it  that  allows  a  boundary  updating  algorithm  to  exist. 


4.2.  The  FastCovers  predicate 

The  definition  of  the  partial  order  is  simplified  if  grammars  in  the  derivational 
version  space  are  represented  in  a  special  way.  as  a  triple.  The  first  element  of  the 
triple  is  a  sequence  of  unlabeiled  simple  derivation  trees,  with  the  nodes  numbered  as 
m  figure  4-1.  The  second  element  of  the  triple  is  a  partition  of  the  trees  nodes 
The  third  element  is  the  grammar  s  rules.  For  instance,  grammar  4  4  of  figure  4-2  is 
represented  by  the  following  triple: 

Tree  sequence: 

(1  b),  (1  (2  a)( 3  b)) 

Partition: 

{1},  {2  3} 

Rules : 

S  -»  b 
S  AA 

A  -*  a 

A  b 

The  triple  representation  allows  the  FastCovers  relation  to  be  defined  as  follows- 

Definition-  Given  two  grammars.  X  and  V.  in  triple  form  grammar  X 
FastCovers  grammar  Y  if  (i)  both  grammars  are  labellings  of  the  same  tree 
sequence  tie.,  the  first  elements  of  their  triples  are  the  same),  and  (2)  the 

partition  (i  e  .  second  element  of  the  triple)  of  Y  is  a  refinement3  of  the 

partition  of  X. 

For  instance,  a  grammar  whose  partition  is  t ;  1 ; . \2\ .  |3!)  is  FastCovered  by  the 

grammar  above:  a  grammar  whose  partition  is  (|i  2\,  ;3!)  is  not  FastCovered  bv  the 

grammar  above,  nor  does  it  FastCover  the  grammar  above 

FastCovers  is  named  after  Covers,  a  partial  order  used  in  early  work  on 

grammar  induction  (Reynolds.  1968.  Horning,  1969.  Pao.  1969).  Although  we  will  not 
pause  to  define  Covers,  it  can  be  shown  that  FastCoverstx.y)  implies  Covers(x.y).  but 
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Covers(x.y)  does  not  imply  FastCoverstx.y).  FastCovers  is  used  instead  of  Covers  for 
ordering  the  derivational  version  space  because  it  is  faster  to  compute  than  Covers 
and  it  makes  the' Update  algorithm  simpler 

It  is  simple  to  show  that  the  FastCovers  relationship  is  transitive  and  reflexive 
because  the  refinement  relationship  is  Moreover,  because  every  grammar  in  the 
derivational  version  space  has  a  triple  form,  FastCovers  applies  to  every  pair  of 
grammars  in  a  derivational  version  space.  Thus,  FastCovers  partially  orders  the 
derivational  version  space 

A  second  property  of  FastCovers.  which  is  needed  in  showing  that  the  Update 

algorithm  is  correct,  is: 

Theorem  8:  For  any  two  grammars,  X  and  V.  in  triple  form. 

FastCoverstX.Y'i  implies  Generaiizes(X.Y). 

The  proof  follows  from  observing  that  the  refinement  relationship  between  the 
nonterminals  ( =  partition  elements)  of  X  and  the  nonterminals  of  Y  establishes  a 
mapping  that  takes  Ys  nonterminals  onto  X  s  nonterminals.  Every  derivation  in 
grammar  Y  can  be  turnea  into  a  derivation  in  grammar  X  by  mapping  Y's  nonterminals 
onto  X's  nonterminals.  Thus,  every  string  that  has  a  derivation  in  Y  must  have  a 
derivation  in  X  as  well.  So  the  language  generated  by  Y  is  a  subset  of  the  language 
generated  by  X.  i.e..  Generalizes(X.Y). 

Given  a  derivational  version  space,  there  is  always  a  finite  set  of  maximal 
elements  in  FastCovers  and  a  finite  set  of  minimal  elements.  The  finiteness  of  the 

boundaries  follows  from  the  finiteness  of  the  space  itself.  We  will  call  the  maximal 

and  minimal  sets  the  derivational  G  and  S.  respectively.  =rom  the  preceding  theorem, 
it  follows  immediately  that 

Theorem  9:  The  derivational  G  (S)  includes  the  subset  of  the 
derivational  version  space  that  is  maximal  (minimal)  with  respect  to  the 
Generalizes  relationship. 

Given  a  derivational  [S.G],  the  FastCovers  relationship  can  be  used  to  determine 
whether  a  given  grammar  is  in  the  derivational  version  space  represented  by  the  pair 

Theorem  10:  Given  a  grammar  x  m  triple  form  and  a  derivational  [G . S] . 
x  is  in  the  derivational  version  space  represented  by  [G.3]  if  and  only  if  there 
is  some  g  in  G  such  that  g  FastCovers  x.  and  some  s  m  S  such  that  x 
FastCovers  s. 

The  proof  of  the  theorem  is  m  the  appendix 


5.  An  Update  algorithm  for  the  derivational  version  space 

The  preceding  section  discussed  the  definitions  of  the  structure  that  we  wish  to 
generate.  This  section  presents  the  algorithm  that  generate^  the  structure,  then 
reports  the  results  of  several  experiments  with  it  it  begins  by  presenting  an  informal 
account  of  what  happens  as  positive  and  negative  strings  are  presented. 
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The  derivational  version  space  under  FastCovers  is  a  set  of  partition  lattices.9 
one  lattice  for  each  tree  sequence  in  the  simple  tree  product.  One  can  visualize  the 
space  as  a  loaf  of  sliced  bread,  one  slice  for  each  tree  sequence  All  the  FastCovers 
relationships  run  inside  slices:  none  cross  from  one  slice  to  another.  Each  slice  is  a 
partition  lattice.  It  has  one  maximal  partiion  on  top  and  one  minimal  partition  on  the 
bottom.  The  top  partition  has  just  one  element,  and  the  element  has  all  the  nodes  in 
the  tree  sequence  for  that  slice.  The  top  partition  for  tree  sequence  of  figure  4-2  is 
( ;  1 ,  2.  3!).  The  bottom  partition  in  each  lattice  has  a  singleton  partition  element  per 
node  in  the  tree  sequence.  The  bottom  partition  for  the  tree  sequence  of  figure  4-2 
is  ( )  1 ! ,  J2J , ; 3 ! ).  All  the  slices/lattices  have  unique  top  and  bottom  partitions. 

if  there  are  no  negative  instances  in  the  presentation,  then  G  consists  of  the  top 
partition  in  each  lattice.  As  negative  instances  are  presented,  the  maximal  set  for 

each  lattice  may  descend.  Thus,  the  G  set  expands  and  the  derivation  version  space 
shrinks  as  negative  strings  are  presented.  The  S  set  always  consists  of  the  bottom 
partition  in  each  lattice.  Presentation  of  negative  instances  does  not  effect  the  S  set 

When  a  new  positive  instance  is  presented,  the  derivational  version  space  grows 
horizontally,  so  to  speak  <i.e. .  the  loaf  gets  more  slices,  and  the  slices  get  larger.)  if 
the  newly  added  positive  string  has  more  than  one  member,  there  will  be  more  than 
one  unlabelled  simple  derivation  tree  for  it.  Hence,  the  simple  tree  product  will 
increase  in  size,  and  the  set  of  partition  lattices  will  increase  as  well  (i  e..  the  loaf 
gets  more  slices)  Moreover,  each  of  the  new  tree  sequences  is  longer  than  the 

corresponding  old  one.  because  some  unlabelled  derivation  tree  for  the  new  string  has 
been  added  to  it.  The  new.  longer  tree  sequence  will  have  more  nodes  (again, 

assuming  that  the  string  has  more  than  one  member),  With  more  nodes  available  for 
partitioning,  the  partition  lattices  will  expand.  Thus,  the  loaf's  slices  get  larger  in 
short,  presenting  a  positive  string  increases  the  number  of  partition  lattices  and  the 
sizes  of  the  partition  lattices. 

Presenting  a  new  positive  string  affects  the  derivational  S  and  G  sets  in  the 

following  ways.  The  increase  in  the  number  of  partitions  implies  that  the  derivational  S 
grows  because  its  members  are  always  the  bottom  partitions  of  the  partition  lattices. 
The  affect  on  the  derivational  G  is  more  subtle  if  there  are  no  negative  instances 
then  G  grows  because  its  members  are  the  top  elements  of  the  partition  lattices  if 
there  are  negative  instances,  then  G  may  grow  as  positive  instances  are  presented, 
but  we  have  no  proof  that  it  must  grow  Although  the  number  of  maximal  sets  grows 
the  size  of  the  sets  may  decrease,  leaving  the  overall  G  set  the  same  size,  or 
perhaps  even  decreasing  it. 


5  1  The  Update  algorithm 

As  mentioned  earlier,  our  algorithm  does  not  bother  to  maintain  the  S  set 
although  it  could  easily  do  so.  instead,  it  maintains  P~  the  set  of  positive  strings 
seen  so  far  This  makes  the  algorithm  more  efficient 
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The  Update  algorithm  is  incremental  It  takes  an  instance  and  the  current 
(P  *  G]  pair,  and  returns  a  revision  of  the  pair  that  is  consistent  with  the  given 
instance.  if  there  is  no  such  revision  then  the  algorithm  returns  some  error  value, 
such  as  NIL.  The  following  describes  the  top  level  of  the  algorithm 

1  if  the  string  is  positive  and  a  member  of  P .  then  do  nothing  and  return 
the  current  version  space.  If  the  string  is  not  a  member  of  P  * .  then  add 
it  to  P*  and  call  Update-G +• 

2.  if  the  string  is  negative  and  a  member  of  P*.  then  return  NIL  if  the 
string  is  not  a  member  of  P  *  then  call  Update-G- 

The  subroutine  Update-G-  is  simpler  than  Update-G*.  so  it  will  be  described  first 

The  task  of  Update-G-  is  to  modify  G  so  that  none  of  the  grammars  will  parse 
the  negative  string.  The  easiest  way  to  do  this  is  with  a  queue,  which  is  initialized  to 
contain  ail  the  grammars  in  G  The  basic  cycle  is  to  pick  a  grammar  off  the  aueue 
and  see  if  it  parses  the  negative  string,  if  it  does  not.  then  it  can  be  placed  m  New- 
G.  the  revised  version  of  G  If  it  does  parse  the  string,  then  the  algorithm  refines  the 
node  partition  once,  m  all  possible  ways  That  is.  it  takes  a  partition  such  as  ( :  i  2 
3 : . 4  5<).  and  breaks  one  of  the  partition  elements  in  two.  in  this  case,  there  are 
four  possible  one-step  refinements: 

i  2  3 : . : 4  5; 

2.  ;t  2\,  ; 3 ;  'A  5; 

3.  ;i  3:  : 2 : .  ’,4  5; 

4  ;i  2  3;  : 4 ;  ; 5 : 

Each  of  these  corresponds  to  a  new  grammar  These  grammars  have  the  property 
that  they  are  PastCovered  by  the  original  grammar  and  there  'S  no  grammar  that 
FastCovers  them  and  not  the  original  grammar  That  <s.  thev  are  just  below  me 

original  grammar  m  the  partial  order  of  FastCovers  This  process  'S  called  "splitting' 
m  the  grammar  induction  literature  (Horning.  1969).’' 

All  the  grammars  produced  by  splitting  are  placed  on  the  aueue  Eventually  the 

new  grammars  will  be  taken  off  the  Queue,  as  described  above  and  tested  to  see  f 

they  parse  the  negitive  string.  Those  that  fail  to  parse  the  negative  string  are  placed 
in  New-G.  Such  grammars  are  maximal  m  the  FastCovers  order  m  that  there  s  no 
grammar  above  them  that  fails  to  parse  the  negative  string  The  basic  cvde  of 

testing  grammars,  splitting  and  queuing  new  grammars  continues  until  the  queue  s 
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exhausted.  At  this  point.  New-G  contains  the  maximal  set  of  the  grammars  that  fail  to 
parse  the  negative  string. 

There  is  one  technical  detail  to  mention  it  is  possible  for  the  same  grammar  to 
be  generated  via  several  paths.  Before  queuing  a  new  grammar,  the  queue  and  New- 
G  are  checked  to  make  sure  the  new  grammar  is  not  FastCovered'1  by  some  existing 
grammar 

New-G  should  contain  only  grammars  that  are  it)  simple.  (2)  consistent  with  the 
positive  presentation.  <3)  consistent  with  the  negative  presentation,  and  (4)  maximal  m 
the  FastCovers  partial  order  The  following  comments  prove  that  the  Update-G- 
algorithm  satisfies  those  criteria 

i  The  grammars  are  simple,  because  the  unlabelled  derivation  trees  from 

which  they  are  constructed  are  simple 

2.  The  grammars  are  consistent  with  the  cositive  presentation,  because  they 
are  a  labelling  of  a  set  of  derivation  trees  for  those  strings  Therefore, 
they  are  guaranteed  to  parse  those  strings 

3  The  grammars  are  consistent  with  the  the  negative  string  just  received 

because  the  test  puts  the  grammars  m  New-G  only  if  they  fail  to  parse 
that  string  The  grammars  are  consistent  with  the  negative  strings  received 
prior  to  this  one.  because  the  grammars  from  the  old  G  were  consistent, 
and  splitting  moves  down  the  FastCovers  order  so  splitting  reduces  the 

language  generated  by  a  grammar  and  never  expands  it 

4  The  grammars  are  maximal  n  me  FastCovers  order  because  splitting 

moves  down  the  order  one  steo  at  a  time  and  the  movement  s  stocoed 
as  soon  as  the  grammar  becomes  consistent  with  the  presentation  '2 

This  completes  the  discussion  of  Uodate-G-  We  now  turn  to  uoda;e-G*  the  function 
that  '■evises  the  G  set  when  some  of  the  grammars  m  it  do  not  parse  a  newiy 
received  positive  string. 

The  easiest  way  to  explain  Uodare-G-  s  ’o  ‘irst  describe  an  algorithm  -Kat  s 
not  mcrementai  it  takes  the  whole  presentation  at  once  and  builds  the  appropriate  3 
set  The  non-mcremental  algorithm  proceeds  m  the  following  steps 

i  Form  the  simple  tree  product  ov  'akmg  the  Cartesian  product  of  the 
uniabeiied  simple  derivation  trees  ‘or  eac*1  positive  string 


~  ^ignr'thm  t c ' r «? c ’ 'v  c ‘tr  ' ^ $  3rd 

=.;'«r,-g  grammars  •■^at  are  sv  grammars  ‘"im  3:st"'c'  jnac9l|e3  'r*e  seauences  .viH 

s?3"!n  earns  •*at  -^av  »ao  ’3  .a^d  *K,s  s  ~'a,n  -•astn  ‘?r  v  ,K 

"3'iC:.e,s  •ath»r  "ran 

c 4  cmc*  -vouig  ’»aune  js  °g  •*-»  'ac  'hai  ‘ne  3ef,vaticnai  .ers'Cn  scace  'S  a  set  ;t  ' att >c e s 

a^d  aivces  are  oa'^tjianv  .vAi'-cdnrec’ed 
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2.  For  each  such  tree  sequence  in  the  simple  tree  product,  form  a  triple  to 
represent  the  grammar  that  has  only  one  nonterminal.  S  The  partitions  for 
these  grammars  all  have  just  one  partition  element,  and  the  element 
contains  all  the  nodes  in  the  derivation  tree  sequence  These  grammars  are 
the  maximal  grammars  in  the  FastCovers  partial  order  They  would  be  the 
G  set  if  the  presentation  had  only  the  positive  strings. 

3  For  each  string  in  the  set  of  negative  strings,  apply  Update-G- 

This  algorithm  is  not  incremental,  since  it  must  process  all  the  positive  strings  before  i 
processes  any  of  the  negative  strings.  An  incremental  Update  algorithm  must  be  able 
to  handle  strings  in  the  order  they  are  received.  The  incremental  algorithm  should 
take  a  G  set  whose  grammars  may  have  already  been  split  somewhat  by  Update-G- 
and  modify  it  to  accomodate  a  new  positive  string. 

in  the  non-incremental  algorithm  the  effect  of  adding  a  new  string  is  to  increase 
the  length  of  the  seauences  of  uniabeiied  derivation  trees,  and  hence  to  increase  me 
number  of  nodes  m  the  partitions,  in  the  incremental  algorithm,  this  must  be  done  n 
all  possible  ways,  so  the  resulting  Update  algorithm  is. 

1  Given  a  positive  string,  form  the  set  of  all  unlabelled  simple  derivation  trees 
for  that  string 

2.  For  each  grammar  in  the  old  G  and  for  each  tree  for  the  new  positive 

String. 

a  append  the  tree  onto  the  end  of  the  tree  sequence  of  the  grammar  s 
triple,  and 

b  allocate  the  new  tree  s  nodes  to  me  partition  elements  m  an  possible 
ways  Thus,  if  there  are  N  partition  elements  m  the  partition  ’nen 

there  are  N  choices  for  where  to  out  the  first  tree  node  N  choices 
for  where  to  out  the  second  tree  node  etc  'f  me  tree  has  \i 

nodes  then  n'"1  new  partitions  will  be  generated  Each  one  becomes 
a  grammar  that  is  a  candidate  for  New-G 

3.  P'ace  ail  the  candidate  grammars  generated  m  the  preceding  step  on  me 

aueue  for  the  Update-G-  algorithm  However  instead  of  testing  that  a 

grammar  >s  consistent  with  just  one  negative  string  as  me  -odate-G- 

aigorithm  does,  test  that  the  grammar  is  consistent  with  ail  me  negat",e 

strings  m  the  presentation  that  have  been  received  so  far 

T*ie  first  two  steps  generalize  the  old  grammars  bv  addmg  njies  *o  mem  rv'e 
new  grammars  might  be  too  general  m  mat  they  mav  oarse  seme  of  me  neg3>n>e 

strings  given  earlier  m  the  presentation  Hgnce  the  !ast  step  must  check  ait  •"e 
negative  strings  This  requires  saving  ail  the  negative  strings  as  mev  are  presented 
Thus  the  version  space  needs  to  be  a  triple  [P  -  P-  G] 


This  means  that  one  of  the  usual  benefits  of  me  version  scace  technique  s  cs? 
Usually  version  space  induction  allows  the  'earner  to  forget  about  an  instance  after 
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having  processed  it.  This  algorithm  requires  the  learner  to  remember  the  instances  in 
the  presentation.  However,  it  is  still  an  incremental  algorithm  After  each  string  13 
presented,  an  up-to-date  G  set  is  produced.  Moreover,  it  is  produced  with  less 
processing  and  memory  than  would  be  required  to  generate  'hat  same  G  set 
completely  from  scratch  using  the  entire  presentation,  in  short,  the  algorithm  is  an 
incremental  version  space  update  with  respect  to  computation,  but  not  with  respect  to 
instance  memory. 


5  2.  illustrations  of  the  algorithm's  operation 

in  order  to  illustrate  the  operation  of  the  algorithm,  this  section  presents  a 
simple  example.  The  next  section  will  continue  this  example  by  showing  how  the 
algorithm  performs  when  it  is  modified  to  incorporate  certain  biases. 

The  illustration  is  based  on  learning  a  command  language  for  a  file  system. 
The  algorithm  receives  strings  of  command  words,  marked  positive  or  negative,  and 
from  these  data,  it  must  infer  a  grammar  for  the  command  language.  Suppose  the 
first  string  is  positive:  "delete  all-of-them  "  There  are  four  possible  unlabelled  simple 
trees  for  this  string,  and  they  lead  directly  to  four  grammars  for  the  G  set  These 
grammars  are  listed  below  in  their  triple  representation 

1.  (1  delete  all-of-them) 

{1} 

(S  -*  delete  all-of-them) 

2.  (1  delete  (2  all-of-them)) 

{1  2) 

(S  -»  delete  S)  (S  -*  all-of-them) 

3.  (1  (2  delete)(3  all-of-them)) 

(1  2} 

( S  -»  S  all-of-them)(S  -»  delete) 

4.  (1  (2  delete)(3  all-of-them)) 

(1  2  3) 

( S  -v  S  S)(S  -»  delete)(S  -*  all-of-them) 

The  first  three  grammars  generate  the  finite  language  consisting  only  of  the 
single  string  'delete  all-of-them  "  The  fourth  grammar  generates  all  possible  strings 
over  the  two  word  vocabulary  of  "delete"  and  "all-of-them  "  Suppose  the  next  string 
is  a  negative  string,  "all-of-them  delete  "  This  string  cannot  be  parsed  by  grammars 
1  2  or  3.  so  they  remain  unchanged  m  the  G  set  The  fourth  grammar  >s  ovenv 
general,  so  it  is  split.  There  are  only  three  legal  partitions  Two  of  them  sun/ive 
becoming  grammars  5  and  6  shown  below  The  other  partition,  ;  1 ;  ; 2  3!  yields  a 
grammar  that  parses  the  negative  string,  so  it  is  split  further  into  ;  1 ;  :  2 ;  :  3 :  This 

partition  is  FastCovered  by  the  two  survivors,  so  it  is  abandoned  The  survivors  are. 
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5.  (1  (2  delete)(3  all-of-them) ) 

(1  2} {3} 

(S  S  A ) ( S  ->  delete)(A  -»  all-of-them) 


6.  (1  (2  delete)(3  all-of-them)) 

{1  3} {2} 

(S  ->  A  S )  ( A  -»  delete)(S  -*  all-of-them) 

Suppose  the  next  string  is  "delete  delete,"  a  negative  instance.  None  of  the 
grammars  in  G  parse  this  string,  so  the  G  set  remains  unchanged.  This  illustrates 
that  the  algorithm  is  an  inductive  leap  while  processing  the  preceding  strings.  This 
string  is  new.  but  there  is  no  change  m  the  version  space. 

Suppose  the  next  string  is  positive,  "delete  it."  There  are  four  possible 
unlabelled  simple  derivation  trees  for  this  string.  Each  is  paired  with  each  of  the  five 
grammars  in  the  current  G.  yielding  20  combinations.  The  resulting  20  grammars  are 
queued  for  testing  against  P-  Some  splitting  occurs  during  the  processing  of  'he 
queue.  When  the  queue  is  finally  exhausted.  New-G  has  25  grammars. 

Table  5-1  summarizes  the  results  so  far.  and  shows  what  happens  as  more 
instances  are  presented  As  a  rough  indication  of  the  practicality  of  the  algorithm,  the 
table  shows  the  number  of  CPU  seconds  used  in  processing  each  instance  by  a  Xerox 
1109  running  interlisp  The  combinational  explosion  inherent  in  the  Update-G  * 
algorithm  is  quite  evident  However,  the  algorithm  is  fast  enough  to  cons'ruct  small 
version  spaces 


Table  5-1:  Learning  a  command  language 


Ins  t  inces 

Size  of  G  set 

CPU  seconds 

*  delete  all-of-them 

a 

0.03 

-  all-of-them  delete 

5 

0.25 

-  delete  delete 

5 

0.52 

*  delete  it 

25 

11.80 

-  i  t  i  t 

25 

0.32 

-  print  it 

197 

526.00 

-  print  all-of-them 

2580 

20300.00 

5  3  Biasing  the  Update  algorithm 

Better  performance  can  be  obtained  by  using  the  Update  algorithm  as  a 
framework  upon  which  biases  can  be  mounted  There  are  several  places  in  the 
algorithm  where  biases  can  be  installed  One  place  is  in  the  queue-based  loop  of 
Uodate-G-  Currently  new  grammar  triples  are  placed  on  the  queue  only  if  they  are 
not  FastCovered  by  existing  grammar  triples  This  filter  can  be  made  stronger  For 
instance,  suppose  we  queue  only  grammars  that  have  a  minimal  mumper  of 
nonterminals,  that  is.  grammar  triples  with  partitions  of  minimal  cardinality  Table  5-2 
shows  the  results  of  running  the  previous  illustration  with  this  bias  installed. 


25 


Table  5-2:  A  bias  for  minimum  number  of  nonterminals 


Instances 

Size  of  G  set 

CPU  seconds 

♦  delete  all-of-them 

4 

0.03 

-  all-of-them  delete 

3 

0.28 

-  delete  delete 

3 

0.05 

*  delete  it 

7 

1.66 

-  it  it 

7 

0.09 

*  print  it 

17 

5.87 

+  print  all-of-them 

55 

19.40 

The  bias  reduces  the  G  set  from  2580  grammars  to  55  grammars.  All  of  these 
grammars  happen  to  use  a  single  nonterminal,  e  g., 

S  -4  S  all-of-them 
S  4  S  it 
S  -»  delete 
S  -4  print 

Processing  time  is  drastically  reduced  since  many  grammar  triples  --  those  with 
partitions  having  cardinality  larger  than  that  of  some  existing  consistent  grammar  triple 
-  are  not  even  generated. 

Another  filter  that  can  be  placed  on  the  Update-G-  loop  is  one  which  limits  how 
deeply  into  the  partition  lattice  the  search  may  delve  We  implemented  a  filter  which 
allows  the  user  to  set  a  ''ply.''  if  a  grammar  triple  with  partition  of  cardinality  m 
needs  to  be  split,  the  search  will  proceed  only  to  partitions  of  cardinality  m^n.  where 
n  is  the  ply  set  by  the  user  Table  5*3  indicates  that  this  bias  apDroximates  the 
results  of  the  unbiased  algorithm  more  closely  than  does  minimizing  the  number  of 
nonterminals  Note  especially  that  for  a  ply  of  two.  all  the  grammars  of  the  unbiased 
algorithm  were  produced  at  a  fraction  of  the  processing  time 

Table  5-3:  The  effects  of  limiting  the  splitting  ply 


Ins  tances 

G 

j-pjy 

Secs . 

G 

Secs . 

-  delete  all-of-them 

4 

0.02 

*4 

0.01 

-  all-of-them  delete 

5 

0.39 

5 

0.54 

-  delete  delete 

5 

0 . 09 

2 

0 . 08 

-  delete  it 

25 

6.39 

?  “N 

13.50 

-  it  it 

25 

0.31 

n  ^ 

0.32 

-  print  it 

188 

75 . 20 

197 

204.00 

-  print  all-of-them  2 

406 

1110.00 

2580 

3460.00 

The  desirability  of  these  biases 

will  depend 

on  the  task 

domain.  The  point  'S 

only  that  the  algorithm  provides  a  convenient  framework  from  implementing  such 
biases. 
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Another  place  to  install  biases  is  in  the  subroutine  of  Update-G  *  that  generates 
unlabelled  derivation  trees.  This  placement  allows  the  biases  to  control  the  form  of 
the  grammars.  For  example,  if  the  tree  generator  produces  only  binary  trees,  then  the 
induced  grammars  are  in  Chomsky  normal  form  If  the  tree  generator  is  constrained  to 
produce  only  right  branching  trees,  then  only  regular  grammars  are  considered  in  the 
latter  case,  there  is  only  one  right  branching  simple  tree  for  each  string. 
Consequently,  there  is  only  one  unlabelled  tree  sequence  for  any  given  presentation 
Under  these  circumstances.  FastCovers  is  equivalent  to  Covers,  and  our  algorithm 
becomes  similar  to  Pao's  (1969)  algorithm  for  learning  finite  state  machines.  The  mam 
difference  is  that  Pao's  algorithm  employs  an  explicit  representation  for  the  whole 
version  space,  whereas  our  algorithm  uses  the  more  compact  [P*.P-G]  representation 
Table  5-3  shows  the  results  of  our  algorithm  on  the  test  case  discussed  above  when 
the  bias  for  regular  grammars  is  introduced.  Tables  5-4  and  5-5  show  the  results  for  a 
more  challenging  case,  inferring  the  syntax  of  Unix  file  names. 

Table  5-4:  Inducing  regular  grammars  for  file  system  commands 


Instances 

Size  of  G  set 

CPU  seconds 

*  delete  all-of-them 

1 

.07 

-  all-of-them  delete 

1 

.02 

-  delete  delete 

1 

.01 

*  delete  it 

1 

.07 

-  i  t  i  t 

1 

.02 

*  print  it 

1 

.11 

+  print  all-of-them 

1 

.  12 

Table  5-5:  Learning  regular  grammars 

for  Unix  file  names 

Instances 

G  set  size 

CPU  seconds 

*  foo  .  bar 

1 

0.01 

-  foo 

1 

0.01 

-  bar 

1 

0.01 

*  /  gram  ■/  foo 

1 

0.04 

-  foo  /  /  foo 

5 

1.03 

-  /  /  foo 

1 

0.32 

-  /  usr  /  vsg  /  bar 

43 

88.30 

-  /  /  bar 

32 

20.90 

-  /  /  /  bar 

?  q 
<_  _/ 

9.49 

-  /  usr  /  /  bar 

16 

19.50 

-ft  gram  /  bar 

32 

9.76 

-  /  /  vsg  /  bar 

14 

6.20 

-  vsg  /  /  usr  /  bar 

22 

10.10 

-  f  usr  /  /  gram  /  bar 

15 

17.20 

-  /  usr  /  /  foo 

10 

10.40 

-  /  usr  /  /  gram  /  foo 

5 

7 .55 

-  /  usr  /  /  vsg  /  ba 

2 

8.72 
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Table  5-6:  The  contents  of  the  final  G  set  of  table  5-4 


Grammar  1  Grammar  2 


s 

-* 

/  A 

S 

/  A 

A 

-> 

gram  S 

A 

gram  S 

A 

usr  S 

A 

-* 

usr  S 

A 

vsg  S 

A 

-> 

vsg  S 

S 

/  too 

S 

/  foo 

S 

/  bar 

S 

/  bar 

s 

-k 

.  bar 

A 

.  bar 

s 

f  00 

S 

foo 

s 

bar 

S 

-* 

bar 

s 

foo  S 

s 

foo  A 

The  point  of  this  section  is  that  the  Update  algorithm  is  good  for  more  than  just 
exploring  induction  problems  it  can  be  used  as  a  framework  for  the  development  of 
practical,  task-specific  learning  machines. 


6.  Conclusions 

The  introduction  motivated  the  results  presented  here  from  a  narrow  perspective, 
viz.  their  utility  in  our  research  on  cognitive  skill  acquisition  However,  they  may  have 
wider  application.  This  section  relates  the  results  to  more  typical  applications  of 
grammar  induction,  which  for  purposes  of  discussion,  can  be  divided  into  three 
classes: 

•  Grammar  induction  is  used  as  a  formal  account  of  natural  language 
acauisition  (Osherson.  Stob  &  Weinstein.  1985.  Berwick.  1985.  Langley  & 
Carbonell  1986.  Pinker.  1979).  Learning  the  syntax  of  a  language  is 
regarded  by  some  as  an  important  component  of  learning  a  language,  and 
grammar  induction  is  one  way  to  formalize  the  syntactic  component  of  the 
overall  language-learning  task. 

•  Grammars  are  sometimes  used  in  software  engineering,  eg.  for  command 
languages  or  pattern  recognition  templates  (Gonzalez.  &  Thompson.  1978) 

Some  applications  require  that  the  grammars  change  over  time  or  in 
response  to  different  external  situations.  Por  instance,  a  command 
language  could  be  tailored  for  an  individual  user,  or  a  pattern  recognizer 
might  need  to  learn  some  new  patterns  Grammar  induction  may  be  useful 
in  such  applications  (Fu  &  Booth.  1 975.  Biermann  &  Feldman.  1972). 

•  Knowledge  bases  in  Al  programs  often  have  recursive  hierarchical 

structures,  such  as  calling  structures  for  plans  or  event  schemata  for 
stones.  The  hierarchical  component  of  such  knowledge  is  similar  to  a 

grammar  Grammar  induction  can  be  used  to  acquire  the  hierarchical 
structure,  although  it  must  be  used  in  combination  with  other  induction 
engines  that  acquire  the  other  structures.  For  instance,  the  Sierra  learner 


28 


(VanLehn,  1987)  represents  knowiedge  as  hierarchical  production  rules  it 
uses  a  grammar  induction  algorithm  to  learn  the  basic  skeleton  of  its  rules, 
a  Bacon-like  function  inducer  (Langley.  1979)  to  learn  details  of  the  rules 

actions,  and  a  version  space  algorithm  (Mitchell.  1982)  to  learn  the  exact 

content  of  the  rules'  conditions. 

in  all  these  applications,  the  problem  is  to  find  an  algorithm  that  will  infer  an 
"appropriate''  grammar  whenever  it  receives  a  “typical"  training  sequence  The 

definition  of  “appropriate  grammar"  and  “typical  training  sequence"  depends,  of 

course,  on  the  task  domain.  However,  it  is  usually  the  case  that  only  one  grammar  is 
desired  for  any  given  training.  If  so.  then  the  derivational  version  space  Update 
algorithm  is  not  immediately  applicable,  because  it  produces  a  set  of  grammars,  in 
fact,  the  set  tends  to  grow  larger  as  the  training  sequence  grows  longer.  This  is  why 
we  do  not  claim  that  the  algorithm  models  human  learning,  even  though  it  was 

developed  as  part  of  a  study  of  human  learning  The  algorithm  produces  a  set. 
whereas  a  person  probably  learns  only  one  (or  a  few)  grammars  qua  procedures. 
Similarly,  me  algorithm  is  not  a  plausible  model  of  how  children  learn  natural 

language,  even  in  the  liberal  sense  of  "plausible"  employed  by  studies  of  language 
identification  in  the  limit.'3 

Of  course,  not  all  applications  of  grammar  induction  desire  an  algorithm  to 

produce  just  one  grammar  An  application  might  have  an  inducer  to  produce  a  set  of 
grammars,  and  leave  some  other  process  to  choose  among  them  For  instance,  when 
tailoring  a  command  language,  one  might  have  the  user  choose  from  a  set  of 

grammars  generated  by  the  grammar  inducer 

If  grammar  induction  is  viewed  as  search,  then  producing  just  one  grammar  is  a 
form  of  deDth-first  or  greatest-commitment  search.  The  version  space  strategy,  as 
pointed  out  by  Mitchell  (1982).  is  a  form  of  breadth-first  or  least  commitment  search 
This  gives  n  the  capability  of  producing  informative  intermediate  results.  Halfway 
through  the  presentation  of  the  training  sequence,  one  can  determine  unequivocally 

which  generalizations  have  been  rejected  and  which  generalizations  are  still  viable  if 
this  capability,  or  any  other  feature  of  the  least  committment  approach,  is  important 
then  the  algorithm  presented  here  should  be  considered 

Even  if  the  application  doesn  t  use  grammars  as  the  representation  language,  cr 
even  as  a  component  of  the  representation,  the  techniques  presented  here  may  be 
useful.  The  definition  of  reducedness  extends  readily  to  many  representation 
languages.  For  instance,  a  Prolog  program  is  reduced  if  deleting  any  of  its  Horn 

clauses  makes  the  program  inconsistent  with  the  training  examples  Similarly  the 
technique  for  building  a  derivational  version  space  can  be  extended  to  representations 
other  than  grammars.  it  remains  to  be  seen  whether  there  is  any  utility  m  these 

extensions,  but  they  are  possible  at  least. 


•  7 

t  -night  seem  -hat  after  a  'arge  nUmger  of  examoies  Mad  been  receded,  an  ^e  gramma's  n  mo 

derivational  version  soace  win  gene-ate  exactly  'he  same  'anguage.  and  >h a ■  language  s  me  'a-get 
language.  Thus  would  'mgiv  tdat  our  aigcrith-n  , gentses  languages  m  me  limit,  even  mougrt  >t  produces 

a  set  of  grammars  nstead  of  a  Single  grammar  Kevin  Kellv  and  Clark  Glvmour  ice-sonal  communicat'd"" 

nave  oroved  this  coniecture  false  KeHv  'oe-sonai  communicacont  has  shown  that  anv  identifiable  o'ass 
of  context-free  grammars  'S  identifiable  bv  a  machine  vngse  coniectures  are  aiwavs  'educed,  s'-c'e 

grammars  that  are  consistent  with  'he  data  Kellv  and  Glvmour  conclude  along  with  US.  that  our  -esu'ts 
nave  ''ttie  bearing  on  me  'anguage  dentif'cat'on  literature 
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However,  the  applications  that  seem  most  likely  to  benefit  from  the  derivational 
version  space  approach  are  those  that  are  most  similar  to  our  application,  in  that  their 
research  problem  is  to  understand  the  influence  of  representation  languages  and 
biases  on  learning.  This  amounts  to  studying  the  properties  of  induction  problems. 
rather  than  studying  the  properties  of  induction  algorithms.  in  such  research  it  is 
often  a  useful  exercise  to  study  the  set  of  generalizations  consistent  with  a  given 
training  sequence  and  see  how  that  set  changes  as  the  biases  and  representation 
language  are  manipulated.  Such  sets  are  exactly  what  the  derivational  version  space 
strategy  calculates. 
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7,  Appendix 

7  1  Proof  of  Theorem  3 

The  following  proof  shows  that  there  are  finitely  many  reduced  context-sensitive 
grammars  for  any  given  finite  presentation  Context-sensitive  grammars  are  used  in 
the  theorem  not  only  because  they  give  the  theorem  broader  coverage,  but  because 
they  make  the  proof  simpler  The  proof  is  a  counting  argument,  and  context-sensitive 
grammars  are  defined  by  counting  the  relative  sizes  of  the  left  and  right  sides  of  their 
rules 


Definition:  A  grammar  is  a  context-sensitive  grammar  if  for  all  rules 
a-*/},  we  haveU  I  £  i/0 1,  where  M  means  the  length  of  string  x.14 

Definition:  A  context-sensitive  grammar  is  simple  if  (1)  for  all  rules 
a -»/?.  |a  I  =1/5 1  implies  that  0  has  more  terminals  than  a.  and  (2)  every 
nonterminal  occurs  in  some  derivation  of  some  string. 

Lemma.  The  longest  derivation  of  a  string  s  using  a  simple  context-sensitive 
grammar  is  2 1  s  1-1 

Proof:  Consider  an  arbitrary  step  in  the  derivation.  a -*/?.  if  |al=l/?l,  then  0 
must  contain  at  least  one  more  terminal  than  a  because  the  grammar  is  a  simple 
one.  Consequently,  there  can  be  at  most  I  s  I  such  steps  in  the  derivation,  because 
there  are  I  s  I  terminals  in  the  string.  For  all  the  other  steps  in  the  derivation.  0  must 
be  at  least  one  longer  than  a.  There  can  be  at  most  I  s  1-1  such  steps  in  the 
derivation,  because  the  string  is  only  |  s  |  long.  So  the  longest  possible  derivation 
using  a  simple  grammar  is  2 1  ?  1-1.  where  I  s|  steps  have  U  I  =1/7 1  and  I  s  1-1  steps 
have  la  I  <  \0  I. 

Theorem  3:  There  are  finitely  many  simple  reduced  context-sensitive 
grammars  for  any  given  finite  presentation 

Proof:  There  are  finitely  many  positive  strings  in  the  presentation.  Bv  me 

lemma  just  proved,  the  longest  derivation  of  each  string  a  is  2lal-l.  Therefore  me 
largest  number  of  rule  firings  in  deriving  the  positive  strings  is  less  than  2T.  where  T 
is  the  total  of  the  lengths  of  positive  strings: 

t-  T  u 

a + 

where  P  designates  the  set  of  positive  strings  in  the  presentation,  if  a  grammar  has 
more  than  2T  rules,  then  there  must  be  rules  that  were  not  used  in  any  string  s 
lonrest  derivation.  Such  rules  can  be  eliminated  from  the  grammar  without  affecting 
the  existence  of  the  longest  derivations.  So  such  a  grammar  is  reducible.  Thus,  the 
largest  possible  reduced  grammar  will  have  less  than  2T  rules 


Another  definition  of  context-sensitive  grammars  reauires  mat  me  'uies  have  me  ‘orm 
aA0  -4  ay 0  where  A  is  a  nonterminal  and  v  s  ncnemotv  The  de,in'tion  given  aoove  s  ‘-cm 
MOCCroft  and  Ullman  M359'.  who  comment  that  me  ’wo  de'm't'Ons  are  “du'vaient 
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This  result  alone  is  not  enough  to  show  that  there  are  finitely  many  reduced 
grammars,  because  the  rules  could  in  principle  be  arbitrarily  long  or  contain  arbitrarily 
many  distinct  symbols.  However,  if  a  rule  is  used  in  a  derivation  of  some  string  in 

P  +  ,  its  right  side  must  be  shorter  than  L.  where  L  is  the  length  of  the  longest  string 

in  P  +  .  The  rule  s  left  side  must  also  be  shorter  than  L.  If  we  could  show  that  the 

grammar  cannot  have  arbitrarily  many  symbols,  then  we  would  be  done.  Because  the 
terminals  in  the  rules  must  appear  in  the  strings  and  there  are  finitely  many  strings, 
there  are  finitely  many  possible  terminals  in  the  rules,  in  fact,  the  largest  number  of 
terminals  is  T  Because  each  nonterminal  must  participate  in  some  string's  derivation 
(by  simplicity),  each  nonterminal  must  appear  in  some  rule  s  left  side.  There  are  less 
than  2T  rules,  each  with  at  most  L  symbols  on  the  left  side,  so  there  are  less  than 
2LT  possible  nonterminals.  Thus,  the  number  of  simple  reduced  grammars  is  finite 

because:  (i)  the  number  of  rules  is  less  than  2T,  (2)  the  length  of  the  left  and  right 
sides  is  at  most  L.  and  (3)  there  are  less  than  2LT  +  T  symbols  used  in  the  rules. 
This  completes  the  proof  of  theorem  3 


7  2.  Proof  of  Theorem  5 

Theorem  5  states  that  the  derivational  version  space  contains  tne  reduced  version 
space.  The  critical  part  of  the  proof  deals  with  the  positive  strings.  P  +■ .  because 
both  version  spaces  specifically  exclude  grammars  that  generate  negative  strings. 
Hence,  we  prove  the  following  theorem,  from  which  theorem  5  follows  immediately 

Algorithm  A:  Given  a  set  of  strings  P  + .  produce  the  grammars 
corresponding  to  all  possible  labellings  of  all  possible  sequences  of  all 
possible  simple  unlabelled  derivation  trees  for  each  string. 

Theorem  11:  Algorithm  A  produces  a  set  of  grammars  that  contains  aH 
the  reduced,  simple  grammars  for  P  * 

To  prove  the  theorem,  we  need  to  show  that  every  reduced  grammar  is 
generated  by  the  algorithm.  Suppose  that  R  is  a  reduced  grammar.  Because  R 
generates  every  string  in  P  + .  it  generates  at  least  one  derivation  tree  for  every  string 
in  P*  First  we  will  consider  the  case  where  R  generates  exactly  one  derivation  tree 
for  each  string,  then  consider  the  case  where  R  generates  more  than  one  derivation 
tree  for  some  (or  all)  of  the  strings. 

Since  R  generates  exactly  one  derivation  tree  for  each  string,  we  merely  need  to 
show  that  that  sequence  of  derivation  trees  is  among  the  set  of  labelled  parse  tree 
sequences  generated  by  the  algorithm  Because  R  is  simple,  ns  parse  trees  must 
conform  to  the  structural  constraints  that  were  imposed  in  generating  the  set  of 
unlabelled  derivation  trees,  in  other  words,  if  a  node  has  just  one  daughter,  then  the 
daughter  is  a  leaf  Moreover,  the  algorithm  generates  all  such  unlabelled  derivation 
trees,  so  R  s  derivation  trees  must  be  among  those  generated  by  the  algorithm.  R  s 
derivation  trees  are  thus,  some  labelling  of  one  of  the  unlabelled  derivation  tree 
sequences.  However,  the  algorithm  generates  all  possible  labellings  of  these.  So  R  s 
parse  trees  must  be  among  the  set  of  labelled  derivation  tree  sequences  generated  by 
the  algorithm.  The  only  way  for  R  to  have  nonterminals  other  than  those  induced  by 
labelling  the  unlabelled  derivation  trees  would  be  for  R  to  have  rules  that  are  not  used 
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in  generating  its  derivation  trees  for  P  +  .  However.  R  is  reduced,  so  this  cannot  be 
the  case.  Thus,  R  must  be  one  of  the  grammars  induced  by  the  algorithm. 

Next  we  consider  the  case  where  R  generates  more  than  one  derivation  tree  for 
at  least  one  string  in  P  +  For  each  string  p  in  P*.  let  TreesD  be  the  set  of 

derivation  trees  for  p.  generated  by  R.  As  shown  above,  the  algorithm  generates  at 
least  one  of  the  derivation  trees  in  each  TreeQ.  From  that  derivation  tree  sequence,  it 

generates  rules  for  a  grammar.  The  generated  grammar  will  not  be  the  same  as  R  if 
some  of  the  other  derivation  trees  in  Trees„  use  rules  that  are  not  in  this  derivation 

tree  sequence.  However,  if  this  were  the  case,  then  those  rules  could  be  deleted 
from  R.  and  yet  all  the  strings  could  still  be  parsed.  Thus.  R  would  be  reducible, 
contrary  to  hypothesis.  So  R  must  be  generated  by  the  algorithm.  This  completes 
the  proof  of  the  theorem. 


7  3.  Proof  of  theorem  IQ 

The  following  theorem  states  that  the  derivational  version  space  is  properly 
represented  by  its  boundaries: 

Theorem  10:  Given  a  grammar  x  in  triple  form  and  a  derivational  [G  S],  x 
is  in  the  derivational  version  space  represented  by  [G.S]  if  and  only  if  there 
is  some  g  in  G  such  that  g  FastCovers  x.  and  some  s  in  S  such  that  x 

FastCovers  s 

The  "if"  half  of  the  "if  and  only  if"  follows  immediately  from  the  definition  of  G  and 
S:  if  x  is  m  the  space,  then  there  must  be  maximal  and  minimal  elements  above  and 
below  it.  To  show  the  'only  if"  half  suppose  that  x  isn  t  in  the  space,  and  yet  there 
is  a  g  that  FastCovers  it  and  an  s  that  it  FastCovers.  A  contradiction  will  be  derived 
by  showing  that  x  should  be  in  the  derivational  version  space.  First,  we  show  that  x 
is  consistent  with  the  presentation.  Because  x  FastCovers  s.  the  language  generated 

by  x  includes  the  language  generated  by  s.  Because  s  s  language  includes  the 
positive  strings  of  the  presentation,  so  does  xs  language.  Thus,  x  is  consistent  with 
the  positive  strings  of  the  presentation.  8ecause  g  FastCovers  x.  and  g's  language 
excludes  all  the  negative  strings,  x  s  language  must  also  exclude  all  the  negative 

strings.  So  x  is  consistent  with  the  negative  strings  Therefore,  x  is  consistent  with 

the  whole  presentation.  The  remaining  requirement  for  membership  in  the  derivational 
version  space  is  that  x  be  a  labelling  of  some  tree  sequence  from  the  simple  tree 
product  of  the  presentation.  Clearly,  x  is  a  labelling  of  the  tree  seauence  which  is 

the  first  element  of  its  triple.  Because  x  FastCovers  s,  it  must  have  the  same  tree 

sequence  as  x.  so  its  tree  sequence  is  a  member  of  the  simple  tree  product  it 

follows  that  x  should  be  m  the  derivational  version  space  This  contradiction 

completes  the  proof  of  the  theorem. 
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